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Abstract. Binary jumbled pattern matching asks to preprocess a binary string S in order to answer 
queries which ask for a substring of S that is of length i and has exactly j 1-bits. This prob- 

lem naturally generalizes to vertex-labeled trees and graphs by replacing "substring" with "connected 
subgraph". In this paper, we give an 0(v? j log^ n)-time solution for trees, matching the currently best 
bound for (the simpler problem of) strings. We also give an 0(^2/^n^/V(log^)^^^)-time solution for 
strings that are compressed by a grammar of size g. This solution improves the known bounds when the 
string is compressible under many popular compression schemes. Finally, we prove that the problem is 
fixed-parameter tractable with respect to the treewidth w of the graph, thus improving the previous 
best n^(^) algorithm [ICALP'07]. 

1 Introduction 

Jumbled pattern matching is an important variant of classical pattern matching with several ap- 
plications in computational biology, ranging from alignment [4] and SNP discovery [6], to the 
interpretation of mass spectrometry data [9] and metabolic network analysis [21]. In the most basic 
case of strings, the problem asks to determine whether a given pattern P can be rearranged so 
that it appears in a given text T. That is, whether T contains a substring of length \P\ where each 
letter of the alphabet occurs the same number of times as in P. Using a straightforward sliding 
window algorithm, such a jumbled occurrence can be found optimally in 0(n) time on a text of 
length n. While jumbled pattern matching has a simple efficient solution, its indexing problem is 
much more challenging. In the indexing problem, we preprocess a given text T so that on queries 
P we can determine quickly whether T has a jumbled occurrence of P. Very little is known about 
this problem besides the trivial naive solution. 

Most of the interesting results on indexing for jumbled pattern matching relate to binary strings 
(where a query pattern (z, j) asks for a substring of T that is of length i and has j Is). Given 
a binary string of length n, Cicalese, Fici and Liptak [13] showed how one can build in 0{n?) 
time an 0(n)-space index that answers jumbled pattern matching queries in 0(1) time. Their key 
observation was that if one substring of length i contains fewer than j Is, and another substring 
of length i contains more than j Is, then there must be a substring of length i with exactly j Is. 
Using this observation, they construct an index that stores the maximum and minimum number 
of Is in any z-length substring, for each possible i. Burcsi et al [9] (see also [10,11]) and Moosa 
and Rahman [22] independently improved the construction time to 0(n^/logn), then Moosa and 
Rahman [23] further improved it to 0(n^/log^n) in the RAM model. Currently, faster algorithms 
than 0(n^/log^n) exist only when the string compresses well using run-length encoding [3,19] or 
when we are willing to settle for approximate indexes [14]. 

The natural extension of jumbled pattern matching from strings to trees is much harder. In 
this extension, we are asked to determine whether a vertex-labeled input tree has a connected 



subgraph where each letter occurs the same number of times as specified by the input query. 
The difficulty here stems from the fact that a tree can have an exponential number of connected 
subgraphs as opposed to strings. Hence, a sliding window approach becomes intractable. Indeed, 
the problem is NP-hard [21], even if our query contains at most one occurrence of each letter [17]. 
It is even not fixed-parameter tractable when parameterized by the alphabet size [17]. The fixed- 
parameter tractability of the problem was further studied when extending the problem from trees to 
graphs [2,5,15,16]. In particular, the problem (also known as the graph motif problem) was recently 
shown by Fellows et al [17] to be polynomial-time solvable when the number of letters in the 
alphabet as well as the treewidth of the graph are both fixed. 

Our results. In this paper we extend the currently known state-of-the-art for binary jumbled 
pattern matching. Our results focus on trees, and tree-like structures such as grammars and bounded 
treewidth graphs. The problem on such trees turns out to be much harder than on strings and 
require substantially different ideas and techniques. 

• Trees: For a tree T of size n, we present an index of size 0{n) bits that is constructed in 
0{n?/log^n) time and answers binary jumbled pattern matching queries in 0(1) time. This 
matches the performance of the best known index for binary strings. In fact, our index for 
trees is obtained by multiple applications of an efficient algorithm for strings [23] under a more 
careful analysis. This is combined with both a micro-macro [1] and centroid decomposition of 
the input tree. Our index can also be used as an 0(m/ log^ n)-time algorithm for the pattern 
matching (as opposed to the indexing) problem, where i denotes the size of the pattern. Finally, 
by increasing the space of our index to O(nlogn) bits, we can output in O(logn) time a node 
of T that is part of the pattern occurrence. 

• Grammars: For a binary string S of length n derived by a grammar of size we show how 
to construct in 0(^^/^n^/^/ log^^^ n) time an index of size 0{n) bits that answers jumbled 
pattern matching queries on S in 0(1) time. The size of the grammar g can be as much as 
exponentially smaller than n and is always at most 0(n/logn). This means that our time 
bound is 0(n^/log^n) even when S is not compressible. If S is compressible but with other 
compression schemes such as the LZ-family, then we can transform it into a grammar-based 
compression with little or no expansion [12,24]. 

• Bounded Treewidth Graphs: For a graph G with treewidth bounded by it;, we show how 
to improve on the 0(n^(^)) time algorithm of Fellows et al. [17] to an algorithm which runs 
in 2^(^ + ^^i^)fi^W time. Thus, we show that for a binary alphabet, jumbled pattern 
matching is fixed-parameter tractable when parameterized only by the treewidth. This results 
extends easily to alphabets of constant sizes. 

We present our results for trees, grammars, and bounded treewidth graphs in sections 2, 3 and 4 
respectively. 

2 Jumbled Pattern Matching on Trees 

In this section we consider the natural extension of binary jumbled pattern matching to trees. 
Recall that in this extension we are given a tree T with n nodes, where each node is labeled by 
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either 1 or 0. We will refer to the nodes labeled 1 as black nodes, and the nodes labeled as white 
nodes. Our goal is to construct a data structure that on query (z, j) determines whether T contains 
a connected subgraph with exactly i nodes, j of which are black. Such a subgraph of T is referred 
to as a pattern and (z, j) is said to appear in T. The main result of this section is stated below. 

Theorem 1. Given a tree T with n nodes that are colored black or white, we can construct in 
0{v? / n) time a data structure of size 0{n) bits that given a query {i,j) determines in 0(1) 
time if (i^j) appears in T. 

Notice that the bounds of Theorem 1 match the currently best bounds for the case where T is 
a string [22,23]. This is despite the fact that a string has only 0{n^) substrings while a tree can 
have Q{2^) connected subgraphs. The following lemma indicates an important property of string 
jumbled pattern matching that carries on to trees. It gives rise to a simple index described below. 

Lemma 1. // (z, ji) and (i, ^2) both appear in T , then for every ji < j < j2, (hj) appears in T . 

Proof. Let j be an arbitrary integer with ji < j < j2, and let Ti and T2 be two patterns in T 
corresponding to (i, ji) and (i, ^2) respectively. The lemma follows from the fact that there exists 
a sequence of patterns starting with Ti and ending with T2 such that every pattern has exactly 
i nodes and two consecutive patterns differ by removing a leaf from the first pattern and adding 
a different node instead. This means that the number of black nodes in two consecutive patterns 
differs by at most 1. □ 

2.1 A Simple Index. 

As in the case of strings, the above lemma suggests an 0(n)-size data structure: For every i = 
1, . . . ,n, store the minimum and maximum values ij^in and imax such that (i^imin) and (i^imax) 
appear in T. This way, upon query (i, j), we can report in constant time whether (z, j) appears in 
T by checking if imin ^ j ^ imax- However, while 0{n?) construction-time is trivial for strings (for 
every z = 0, . . . , n, slide a window of length i through the text in 0{n) time) it is harder on trees. 

To obtain 0{n?) construction time, we begin by converting our tree into a rooted binary tree. 
We arbitrarily root the tree T. To convert it to a binary tree, we duplicate each node with more 
than two children as follows: Let v he a, node with children ui^ . . . ^Uk, k > 3. We replace v with 
k — 1 new nodes vi, . . . , Vk-i^ make ui and U2 be the children of vi, and make V£-i and U£^i be the 
children of V£ for each ^ = 2, . . . , — 1. We call the nodes . . . dummy nodes. This procedure 
at most doubles the size of T. To avoid cumbersome notation, we henceforth use T and n to denote 
the resulting binary rooted tree tree and its number of nodes. For a node we let denote the 
subtree of T rooted at v [i.e. the connected subgraph induced by v and all its descendants). 

Next, in a bottom-up fashion, we compute for each node of T an array of size \Ty \ + 1. The 
entry Ay[i\ will store the maximum number of black nodes that appear in a connected subgraph 
of size i that includes v and another i — 1 nodes in Ty. Computing the minimum (rather than 
maximum) number of black nodes is done similarly. Throughout the execution, we also maintain a 
global array A such that A[i] stores the maximum Ay[i] over all nodes v considered so far. Notice 
that in the end of the execution, A[i] holds the desired value imax since every connected subgraph 
of T of size i includes some node v and i — 1 nodes in T^. 

We now show how to compute Ay[i] for a node v and a specific value z G {1, . . . , |T^|}. If has 
a single child then v is necessarily not a dummy node and we set Ay[i] — col{v) Au[i — 1], 



3 



where col{v) = 1 if is black and col{v) = otherwise. If v has two children u and then 
any pattern of size i that appears in Ty and includes v is composed of i;, a pattern of size i 
in Tu that includes and a pattern of size i — 1 — £ in that includes w. We therefore set 
Ay[{\ = co/(i;) + maxo<^<i-i{A^,[^] + - 1 - ^]} and = maxi<^<^_i{A^,[^] + - 1 - ^]} 
when is a dummy node. Observe that in the latter the £ index starts with 1 to indicate that the 
non-dummy copy of v is already included in the pattern. 

Lemma 2. The above algorithm runs in 0{n^) time. 

Proof. The computation done on nodes with one child requires 0{n) time, hence the total time 
required to compute all arrays Ay for such nodes is O(n^). The time required to compute all arrays 
for nodes with two children is asymptotically bounded by the sum ^yOi{v)l3{v)^ where a{v) and 
P{v) denote the sizes of the two subtrees rooted at each of the children of and the sum is taken 
over all nodes v with two children. For a tree rooted at r, we let cost{r) denote this sum over all 
nodes in and argue by induction that cost{r) is bounded by |T^p = 0{n?). 

Let r be the root of a tree with n nodes, and let u and v denote the two children of r. Let 
X denote the size of the subtree rooted at u. Then x < n, and the size of the subtree rooted at 
V is n — 1 — X. By induction, we have cost{u) < x^ and cost{v) < (n — 1 — x)^. Thus, cost{r) = 
x{n — 1 — x) + cost{u) + cost{v) < — x{n — x) < n^. □ 

Note that if at any time the algorithm only stores arrays Ay which are necessary for future 
computations, then the total space used by the algorithm is 0{n). The space can be made 0{n) 
hits by storing the Ay arrays in a succinct fashion (this will also turn out useful later for improving 
the running time): Observe that A[i\ is either equal to Ay[i\ or to Ay[i] + 1. This is because any 
pattern of size i with h black nodes can be turned into a pattern of size i — 1 with at least 6—1 black 
nodes by removing a leaf. We can therefore represent Ay as a binary string By of n bits, where 
By [0] =0, and By [i] — Ay[i]— Ay[i — 1\ for alH = 1 , . . . , n — 1 . Notice that since Ay [i] — Yl\=() [f] i 
each entry of Ay can be retrieved from By in 0(1) time using rank queries [20]. 

2.2 Pattern Matching 

Before improving the above algorithm, we show that it can already be analyzed more carefully to 
get a bound of 0(n • i) when the pattern size is known to be at most i. This is useful for the pattern 
matching problem: Without preprocessing, decide whether a given pattern (z, j) appears in T. 

In the case of strings, this problem can trivially be solved in 0{n) time by sliding a window 
of length i through the string thus effectively considering every substring of length i. This sliding- 
window approach however does not extend to trees since we cannot afford to examine all connected 
subgraphs of T. We next show that, in trees, searching for a pattern of size i can be done in 0{n • i) 
time by using our above indexing algorithm. This is useful when the pattern is small (i.e., when 
i = o{n)). Obtaining 0{n) time remains our main open problem. 

Lemma 3. Given a tree T with n nodes that are colored black or white and a query pattern {i^j), 
we can check in 0(n • i) time and 0{n) space if T contains the pattern (i^j). 

Proof. In our indexing algorithm, every node v computes an array Ay of size \Ty\. When the 
pattern size is known to be i we can settle for an array Ay of size min{|T^|,z}. Recall from the 
above discussion that we can assume T is a binary tree. Consider some node v that has only one 
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child u. We can compute from Au in time 0(min{|T^|, i}) — 0{i). Summing over all such nodes 
V gives at most 0{n -i). If on the other hand, node v has two children u and w then Ay is computed 
from Au and A^j in 0{mm{\Tu\,i} • min{|r^(;|, z}) time. We claim that summing this term over all 
nodes in T that have two children gives 0{n • i). 

To see this, first consider the subset of nodes V — {v \ \Ty \ < i and |7]5^^g^^(^)| > i}, where 
parent{v) denotes the parent of v in T. Notice that each subtree Ty G {Ty : v G V} is of size at most 
i and that these subtrees are disjoint. By Lemma 2 we know that computing Ay (along with every 
Au for vertices u G Ty) is done in 0(|T^p) time. The total time to compute Ay for all nodes v 
and their descendants is therefore cost{v) — X^^^y \Tv\^' Since every \Ty\ < i and "^y^y \Ty\ < n, 
we have that cost{v) is upper bounded by 0{n • i) that is achieved when all \Ty\s are equal to i and 
\V\ ^n/i. 

The remaining set of nodes S consists of all nodes v such that v has two children w and 
\Ty \ > i. We partition these nodes into Si = {v ^ S : \Tu\ > i and |T^| > i} and 5^2 = S\Si. Notice 
that \Si\ = 0{n/i). Therefore, computing Ay for all nodes v ^ Si can be done in 0{\Si\'i^) — 0{n-i) 
time. We are left only with the vertices of These are all vertices v such that at least one of their 
children is in V. Denote this child as d{v). Computing Ay for all nodes in S2 can therefore be done 
in time 

J2 o(|r,(„)| ■i) = i-Y^ o(|r,(„)|) = i . o(|r„|) = o{i ■ n). 

veS2 veS2 u^V 

□ 

2.3 An Improved Index 

In this subsection, we will gradually improve the construction time from O(n^) to 0(n^/log^n). 
For simplicity of the presentation, we will assume the input tree T is a rooted binary tree. This 
extends to arbitrary trees using a similar dummy-nodes trick as above. 

Prom trees to strings. Recall that we can represent every Ay by a binary string By where By[i] = 
Ay [i]—Ay [i — l]- We begin by showing that if v has two children w then the computation of By can 
be done by solving a variant of jumbled pattern matching on the string Sy = Xy o col{v) o Yy (here o 
denotes concatenation) of length l^^l = + + 15 where Xy is obtained from Bu by reversing it 
and removing its last bit, and Yy is obtained from By^ by removing its first bit. We call the position in 
Sy with col{v) the split position of Sy. Recall that Ay[i\ — co/(i;)+maxo<^<i_i{A-^[^]+A-y;[i — 1 — ^]}. 
This is equal to the maximum number of Is in a window of S that is of length i and includes the 
split position of Sy. 

We are therefore interested only in windows including the split position, and this is the important 
distinction between the standard jumbled pattern matching problem on strings. Clearly, using 
the fastest 0(n^/ log^ n)-time algorithm [23] for the standard string problem we can also solve 
our problem and compute Ay in 0{\S\^ / n) time. However, recall that for our total analysis 
(over all nodes v) to give 0(n^/log^n) we need the time to be 0(|X^| • |y^|/log^n) and not 
0((|X,| + |n|)Vlog2n). 

First speedup. The 0(log^ n)-factor speedup for jumbled pattern matching on strings [23] is 
achieved by a clever combination of lookup tables. One log factor is achieved by computing the 
maximum Is in a window of length i only when z is a multiple of 5 = (logn)/6. Using a lookup 
table over all possible windows of length 5, a sliding window of size i can be extended in 0(1) time 
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to all windows of sizes z + + 5 — 1 that start at the same location (see [23] for details). 

Their algorithm can output in 0(n^/logn) time an array of 0(n/logn) words. For each i that is 
a multiple of s, the array keeps one word storing the maximum number of Is over all windows of 
length i and another word storing the binary increment vector for the maximum number of Is in 
all windows of length z + 1, . . . , z + 5 — 1. 

By only considering windows that include the split position of Sy^ this idea easily translates to 
an O ( I I • I I / log n)-t ime algorithm to compute Ay and implicitly store it in O ( ( | | + 1 | ) / log n) 
words. From this it is also easy to obtain an 0((|X-y | + |y^|)/ logn)-words representation of By. Notice 
that if V has a single child then the same procedure works with \Xy\ = in time 0(\Yy\/ log n) = 
0(n/logn). Summing over all nodes v, we get an 0(n^/ logn)-time solution for binary jumbled 
indexing on trees. 

Second speedup. In strings, an additional logarithmic improvement shown in [23] can be obtained 
as follows: When sliding a window of length z (z is a multiple of s) the window is shifted s locations 
in 0(1) time using a lookup table over all pairs of binary substrings of length < s (representing the 
leftmost and rightmost bits in all these s shifts). This further improvement yields an 0{n? / log^ n)- 
time algorithm for strings. In trees however this is not the case. While we can compute Ay in 
0{{\Xy \ + \Yy\)'^ / log^ n) time, we can guarantee 0{\Xy \ • \Yy\/ log^ n) time only if both \Xy\ and 
ly^l are greater than s. Otherwise, say \Xy\ < s and > 5, we will get 0(|X-y| • |y^|/|X-y| logn) = 
0(\Yy\/ logn) time. This is because our windows must include the col{v) index and so we never 
shift a window by more than \Xy\ locations. Overcoming this obstacle is the main challenge of 
this subsection. It is achieved by carefully ensuring that the 0(1^^1/ logn) = 0(n/logn) costly 
constructions will be done only 0(n/logn) times. 

A micro-macro decomposition. A micro-macro decomposition [1] is a partition of T into 
0(n/logn) disjoint connected subgraphs called micro trees. Each micro tree is of size at most 
logn, and at most two nodes in a micro tree are adjacent to nodes in other micro trees. These 
nodes are referred to as top and bottom boundary nodes. The top boundary node is chosen as the 
root of the micro tree. The macro tree is a rooted tree of size 0(n/logn) whose nodes correspond 
to micro trees as follows (See Fig 1): The top boundary node t{C) of a micro tree C is connected 
to a boundary node in the parent micro tree parent{C) (apart from the root). The boundary node 
t{C) might also be connected to a top boundary node of a child micro tree child{C).'^ The bottom 
boundary node b{C) of C is connected to top boundary nodes of at most two child micro trees i{C) 
and r(C) of C. 

A bottom up traversal of the macro tree. With each micro tree C we associate an array Ac- 
Let Tc denote the union of micro tree C and all its descendant micro trees. The array Ac stores the 
maximum Is (black nodes) in every pattern that includes the boundary node t{C) and other nodes 
of Tc- We also associate three auxiliary arrays: Ai^^At and The array ^5 stores the maximum 
Is in every pattern that includes the boundary node h{C) and other nodes of C, T^^c)^ ^^^d T^(c). 
The array At stores the maximum Is in every pattern that includes the boundary node t{C) and 
other nodes of C and T^hUdiC)- Finally, the array Afi, stores the maximum Is in every pattern that 
includes both boundary nodes t(C) and b(C) and other nodes of C, T^(^c)^ Tr^(^c)- 

^ The root of the macro tree is unique as it might have a top boundary node connected to two child micro trees. We 
focus on the other nodes. HandUng the root is done in a very similar way. 



6 



Fig. 1. A micro tree C and its neighboring micro trees in the macro tree. Inside each micro tree, the black nodes 
correspond to boundary nodes and the white nodes to non-boundary nodes. 



We initialize for every micro tree C its 0(|C|) = O(logn) sized arrays. Arrays Aq and At are 
initialized to hold the maximum Is in every pattern that includes t{C) and nodes of C . This can 
be done in 0(|Cp) time for each C by rooting C at t{C) and running the algorithm from the 
previous subsection. Similarly, we initialize the array to hold the maximum Is in every pattern 
that includes h{C) and nodes of C. The array is initialized as follows: First we check how many 
nodes are Is and how many are Os on the unique path between t{C) and h{C). If there are i Is 
and j Os we set Ati)[k] = for every k < i + j and we set Atb[i + j] = i- We compute Atb[k] for 
all /c > i + j in total 0(|Cp) time by contracting the b{C)-to-t{C) path into a single node and 
running the previous algorithm rooting C in this contracted node. The total running time of the 
initialization step is therefore 0{n- |Cp/logn) = O(nlogn) which is negligible. Notice that during 
this computation we have computed the maximum Is in all patterns that are completely inside a 
micro tree. We are now done with the leaf nodes of the macro tree. 

We next describe how to compute the arrays of an internal node C of the macro tree given 
the arrays of £(C),r{C) and child(C). We first compute the maximum Is in all patterns that 
include b(C) and vertices of T^(c) and T^(^c)- This can be done using the aforementioned string 
speedups in 0{\T£(^c) \ ' |2~^(C)l/log^n) time when both |T^(c)| > logn and > logn and in 

0(n/logn) time otherwise. Using this and the initialized array A^, of C (that is of size \C\ < 
logn) we can compute the final array A^ of C in time 0{{\T£(^c) \ + |2~^(C)I)/ log^) = 0(n/logn). 
Similarly, using the initialized Atb of C, we can compute the final array Afi, of C in 0(n/logn) 
time. Next, we compute the array At using the initialized array At of C and the array At of 
child{C) in time 0(n/logn). Finally, we compute Ac of C using Atb of C and At of child{C) 
m 0((|r^(c)| + |T,(c)| + . \T,hUd{C)\/Wn) time if both \T^^c)\ + \Tr{C)\ + \C\ > logn and 
\Tchild{C)\ > logn and in 0(n/logn) otherwise. To finalize Ac we must then take the entry-wise 
maximum between the computed Ac and At. This is because a patten in Tc may or may not 
include b{C). 

To bound the total time complexity over all clusters C, notice that some computations required 
0{a{v) • f3(v)/ \og^ n) when a(v) > logn and f3{v) > logn are the subtree sizes of two children 
of some node v ^ T. We have already seen that the sum of all these terms over all nodes of T 
is 0(n^/log^n). The other type of computations each require 0(n/logn) time but there are at 



most 0(n/logn) such computations (0(1) for each micro tree) for a total of 0(n^/log^n). This 
completes the proof of Theorem 1. 

2.4 Finding the Query Pattern 

In this subsection we extend the index so that on top of identifying in 0(1) time if a pattern (i, j) 
appears in T, it can also locate in O(logn) time a node v ^ T that is part of such a pattern 
appearance. We call this node an anchor of the appearance. This extension increases the space of 
the index from 0{n) bits to O(nlogn) bits (i.e., 0(n) words). 

Recall that given a tree T we build in 0{v? j log^ n) time an array A of size n—\T\ where 
stores the minimum and maximum values imin and imax such that (i^imin) and (i^imax) appear in 
T. Now consider a centroid decomposition of T: A centroid node c in T is a node whose removal 
leaves no connected component with more than n/2 nodes. We first construct the array A of T in 
0{n? / log^ n) time and store it in node c. We then recurse on each remaining connected component. 
This way every node v ^ T will compute the array corresponding to the connected component 
whose centroid was v. Notice that this array is not the array Ay since we do not insist the pattern 
uses V. Observe that since each array A is implicitly stored in an n-sized bit array and since the 
recursion tree is balanced the total space complexity is O(nlogn) bits. Furthermore, the time to 
construct all the arrays is bounded by T{n) = 2T(n/2) + 0(n^/log^n) = 0(n^/ log^ n). 

Let c denote the centroid of T whose removal leaves at most three connected components Ti, 
and T3 (recall we assume degree at most 3). Upon query (z, j) we first check the array of c if pattern 
(i, j) appears in T (i.e., if ijnin ^ j ^ imax)- If it does then we check the centroids of Ti,T2 and T^. 
If (i, j) appears in any of them then we continue the search there. This way, after at most O(logn) 
steps we reach the first node v whose connected component includes (z, j) but none of its child 
components do. We return v as the anchor node since such a pattern must include v. Finally, we 
note that the above can be extended to output all occurrences of (i, j). 

3 Jumbled Pattern Matching on Grammars 

In grammar compression, a binary string S of length n is compressed using a context-free grammar 
G{S) in Chomsky normal form that generates S and only S. Such a grammar has a unique parse 
tree that generates S. Identical subtrees of this parse tree indicate substring repeats in S. The size 
of the grammar g = \G(S)\ is defined as the total number of variables and production rules in the 
grammar. Note that g can be as much as exponentially smaller than n = l^l, and is always at most 
0(n/logn). We show how to solve the jumbled pattern matching problem on S by solving it on 
the parse tree of G(S'), taking advantage of subtree repeats. We obtain the following bounds: 

Theorem 2. Given a binary string S of length n compressed by a grammar G(S) of size g, we 
can construct in 0(^^/^n^/^/(logn)^/^) time a data structure of size 0{n) bits that on query {i,j) 
determines in 0(1) time if S has a substring of length i with exactly j Is. 

Proof. We will show how to compute the array A such that A[i] holds the maximum number 
of Is in a substring of S of size i. The minimum is found similarly. We use a recent result of 
Gawrychowski [18] who showed how for any £, we can modify G{S) in 0(n) time by adding 0(g) 
new variables such that every new variable generates a string of length at most and S can be 
written as the concatenation of substrings generated by these 0(g) new variables. Thus, we can 
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write S as the concatenation of blocks S — Bi o - - - o Bij with h — 0{n/tj and \Bj\ < such that 
amongst these blocks there are only d = 0{g) distinct blocks B^, . . . , B^. 

For each distinct block B^^ 1 < k < d, we first build an array where ^^[i] stores the 
maximum number of Is over all substrings of B^ of length i. This is done in I log^ n) time per 
block (by using the algorithm of [23] for strings) for a total of • ^ j log^ n). 

We next handle substrings that span over two adjacent blocks. Namely, for each possible pair of 
distinct blocks and S^, 1 < fc, m < d, we build a table ^ where ^[i] stores the maximum 
number of Is over all substrings of o of length i that start in and end in S^. This is 
done in 0{fi j log^ n) time for each pair for a total of 0(^q^^ j log^ n). Recall that, since we use the 
algorithm of [23], the table is implicitly represented by an array of 0(^/logn) words: For 

each i that is a multiple of logn, the array keeps one word storing the maximum number of Is over 
all substrings of length z, and another word storing the binary increment vector for substrings of 
length z + 1, . . . , i + log n — 1. 

Finally, we consider substrings that span over more than two blocks. For each pair of (not 
necessarily distinct) blocks B]. and S^, 1 < A: < m < 6, we build a table A/^^m where A/e,rn[^] stores 
the maximum number of Is over all substrings of o • • • o of length i that start in B}^ and 
end in S^. Notice that all such substring are composed of (1) S^^^ra — ^it+i o • • • o B^-i that is of 
length ik^jn = \Sk,m\ and has jk^m I's, and (2) a suffix of B^ and a prefix of Bj^ whose total length 
is i — ik,m- Also note that we can easily compute ik^m ^tnd jk^m of all 1 < A: < m < 6 in total time 
0{in? /f'). Then, for each Ak^m we set Ak^rn[i] to be jk^m + ^km\.^ ~ ^k,rri\- In other words, we set 
^iEc,m[^] to include the jk^m I's of Sk^m ^tnd the maximal number of I's in a suffix of Bj^ and a prefix 
of Bjri whose total length is i — ik rri' The computation of (an implicit representation of) each Aj^ ^ 
can be done in 0(^/ logn) time by only setting ^^^^^[z] for i's that are multiples of logn (the binary 
increment vectors of Ak^m remain as in A^^). Since there are 0((n/^)^) pairs of blocks and each 
pair requires 0(^/logn) time, we get a total of 0(n^/(^logn)) time. 

Finally, once we have the implicit representation of all A/c,m's we can compute the desired array 
A from them in 0{in? / {£ log n)) time: For each i that is a multiple of log n and each Aj^^r^ we set A[i] 
to be the maximum out of Ak^rn[i\ and A[i] in 0(1) time. The next logn entries of A are computed 
in 0(1) time (as done in [23]) from the increment vectors of Ak^rn[i\ and A[i]. To conclude, we get 
a total running time of 0{g^l^ / log^ n + n^/(^logn)) = 0(^^/^n^/^/(logn)^/^) when I is chosen to 
be (n/^)^/^(logn)^/^. □ 

We also note that similarly to the case of trees (subsection 2.4), if we are willing to increase our 
index space to O(nlogn) bits, then it is not difficult to turn indexes for rfetectin^ jumbled pattern 
matches in grammars into indexes for locating them. To obtain this, we build an index for S and 
recurse (build indexes) on Si — Bio - - - o Bj^ and 5^2 = S/t+i ^' ' '^B^ where l^i | and 1 is roughly 
n/2. This way, like in the centroid decomposition for trees, we can get in O(logn) time an anchor 
index of S. That is, an index of S that is part of a pattern appearance. Furthermore, as opposed 
to trees, we can then find the actual appearance (not just the anchor) in additional 0{i) time by 
sliding a window of size i that includes the anchor. 

4 Jumbled Pattern Matching on Bounded Treewidth Graphs 

In this section we consider the extension of binary jumbled pattern matching to the domain of 
graphs: Given a graph G whose vertices are colored either black and white, and a query (i, j). 
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determine whether G has a connected subgraph G' with i white vertices and j black vertices^. This 
problem is also known as the (binary) graph motif problem in the literature. Fellows et al. [17] 
provided an n^^^^ algorithm for this problem, where w is the treewidth of the input graph. Here 
we will substantially improve on this result by proving the following theorem, asserting that the 
problem is fixed-parameter tractable in the treewidth of the graph. 

Theorem 3. Binary jumbled pattern matching can he solved in f{w) • n^^^"^ time on graphs of 
treewidth w. 

The function f{w) in the theorem above can be replaced with w^^'^^ in case a tree decomposition of 
width w (see below) is provided with the input graph, and otherwise it can be replaced by 2^^^ \ 
Also, the algorithm in the theorem actually computes all queries (z, j) that appear in G, and can 
thus be easily converted to an index for the input graph. 

We begin by first introducing some necessary notation and terminology. Let G — (y(G), E{G)) 
be a graph. A tree decomposition of G is a tree T whose nodes are subsets of V{G)^ called hags^ 
with the following two properties: (z) the union of all subgraphs induced by the bags of T is G, and 
{ii) for any vertex the set of all bags including v induces a connected subgraph in T. We use X 
to denote the set of bags in a given tree decomposition. The width of the decomposition is defined 
as miiix^x |-^| ~ 1- The treewidth of G is the smallest possible width of any tree decomposition 
of G. Given a bag X of a given tree decomposition T, we let Gx denote the subgraph induced 
by the union of all bags in Tx- Bodlaender [7] gave an algorithm for computing a width- it; tree 
decomposition of a given graph with treewidth w in n time. 

We next describe the information we store for each bag in the tree decomposition of G. Let X 
be an arbitrary bag. A partition IIx — {-^o? -^i, • • • , ^x} of X is positive for a given a query (z, j) 
in Gx if there are x disjoint connected subgraphs Gi, . . . , of Gx such that (1) the total number 
of black and white vertices in G' = Gi U • • • U G^; is i and j respectively, and (2) V{G') H Xq = 
and V{Gi) H X = for each ^ = 1, . . . , x. Here we slightly abuse our terminology and allow Xq to 
be the empty set. The information we compute for each bag X is an array Ax which has an entry 
for each possible query (i, j), where the entry Ax[i^j] contains the set of all positive partitions of 
X for (z, j) in Gx- Note that the query (z, j) appears in Gx iff there exists some partition into two 
sets {Xo,Xi} that is positive for (z, j). Since (z, j) appears in G iff (z, j) appears in Gx for some 
bag X G Af, computing the arrays Ax for all bags allows to determine whether (z, j) appears in G. 
Our algorithm computes all arrays Ax in a bottom-top fashion from the leaves to the root of T. 
It is easy to verify that the size of each array Ax is bounded by To get a similar term in 

our running time, we will show that computing the array Ax from the arrays of the children of X 
can be done in polynomial-time with respect to the child array sizes. 

We will work with a specific kind of tree decompositions, namely nice tree decompositions [8]. 
A nice tree decomposition is a binary rooted tree decomposition T with four types of bags: Lea/, 
forget^ introduce^ and join. Leaf bags are the leaves of T and include a single vertex of G, and 
so computing Ax for leaf bags is trivial. A forget bag X has a single child Y with X = y \ {v} 
for some vertex v oi G. Computing Ax from Ay in this case amounts to converting each positive 
partition TTy of y to a corresponding positive partition IIx of X by removing v from the class it 
belongs to in TTy. An introduce bag X also has a single child y, but this time we have X = y Uji;} 
for some vertex v ^ Y oi G. By the properties of a tree decomposition, we know that v is only 
adjacent to vertices of Y in Gx- Computing Ax from Ay in this case requires the consideration of 

^ The difference between the meaning of the query here and elsewhere in the paper is for ease of the presentation. 
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all partitions of X which are formed from positive partitions TTy of Y by adding to a class in TTy 
with one of its neighbors (or adding {v} as a new singleton class). We leave the precise details to 
the full version of this paper, but it should be easy to see that computing Ax in this case, as well 
as in all cases above, can be done in w^^'^^rP^^\ 

The more challenging case is when X is a join bag. A join bag X has two children Y and Z 
in T, with X = y = Z. Consider two partitions TTy = {Yq, • • • , Yy} and IIz = {^o, • • • , Z^} for Y 
and Z. We define the partition Tly © Hz as follows: First we set Xq to be Yq H Zq. The remaining 
classes are constructed such that any pair of vertices in X belong to the same class in 11 x \ {X^} 
iff they belong to the same class in Tly \ {Yq} or to the same class in IIz \ {^o}- 

Let zq and jo respectively denote the number of white and black vertices in X. We claim that if 
{HiJi) and (^2,^2) are two queries for which TTy and IIz are respectively positive in Gy and G^, 
then IIx — IIy ® nz is positive for (zi + Z2 — io^ji + j2 — jo)- This can be verified by considering 
the connected components in U • • • U G^ U Gf • • • U Gf , where , . . . , G^ and Gf , . . . , Gf are 
sets of graphs witnessing that IIy and IIz are positive for (ii, ji) in Gy and {12^32) in G^. It is 
easy to see that the total number of white and black vertices in these components is 11+12 — io and 
ji + j2 — jo, where io white vertices and jo black vertices are subtracted due to double counting 
the vertex colors in X. Moreover, it can be verified that these components intersect X as required 
by Tlx. 

On the other hand, it can also be seen on the same lines that if (i, j) is a query for which IIx 
is positive in Gx, then (z, j) = (zi + Z2 — io^ji + j2 — jo) foi" some pair of queries (zi, ji) and (z2, ^2) 
for which IIy and IIz are positive in Gy and Gz- We can therefore compute Ax[i,j] by examining 
all such pairs (zi, ji) and (z2, ^2), and computing the partition IIy © IIz ioi each pair of positive 
partitions IIy G Ay[ii,ji] and IIz ^ ^z- This requires time. 

To summarize we compute each array Ax in w^^^'^n^^^'^ time. As the total number of bags is 
0(wn)^ we obtain an algorithm whose total running time excluding the time requiring 

to compute the nice tree decomposition T. We note that the running time of our algorithm can 
slightly be improved by using an extension of Lemma 1 to graphs. Also, our result straightforwardly 
extends to an w^^'^^rP^^^ time algorithm for the case where the vertices of G are colored with c 
colors. 

5 Conclusions 

In this paper we considered the binary jumbled pattern matching problem on trees, bounded 
treewidth graphs, and strings compressed by grammars. We gave an 0(^^/^n'^/^)-time solution 
for strings of length n represented by grammars of size ^, an f{w) • n^^-'^^-time solution for graphs 
with treewidth and an 0{in? / log^ n)-time solution for trees. In the latter result, we showed how 
to determine in 0(1) time if a query pattern appears, and how to locate in O(logn) time a node 
of this appearance. Locating the entire appearance remains an open problem. Using Lemma 3, the 
construction time for trees can be made 0{n • z/log^n) if the the query patterns are known to be 
of size at most z. We also note here that the construction time can be made faster on trees that 
have many identical rooted subtrees. This is because the bottom-up construction does not need to 
be applied on the same subtree twice. Finally, perhaps the main open problem stemming from our 
work is to develop an algorithm for the non-indexing variant of binary jumbled pattern matching 
on trees whose performance is closer to the performance of the corresponding algorithm on strings 
{i.e. the 0{n) sliding window algorithm). 
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